This report will analyze the quality of white wines and how chemical factors, such as acidity, sugar, pH levels, and alcohol content affect it. There are 11 input variables taken from physiochemical tests that make up one output data based on sensory data with a dataset of about 4,900 wines.
## [1] 4898 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
The dataset contains 4,898 observations with 12 variables. (The first, X, is simply the count of the observation).
Taking a look at our output variable, quality, we can see that we do not have as much points less than 5 and more than 7. We do not have any values for 0,1,2, and 10.
From the input variables present, alcohol content, as a percentage, is one that most are familiar with. We can see that most of the alcohol percentage of the wines are 10 and below.
The first graph didn’t seem to tell enough of a story so we take the second graph and use a binwidth of 5. It looks as if most wines have residual sugar levels below 20 gram/liter. We do have a few above 20 and one above 60 gram/liter.
Fixed acidity, volatile acidity, and citric acid all have a right skewed structure with all three having what look to be outliers on the higher end. There seems to be a spike in the amount of wines with a citric acid of 0.5.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
This would definitely be something to look at in the future, why is there such a large amount of wines with citric acid of 0.49 yet levels of 0.48 and 0.5 are not that high.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
We see that a large concentration of chloride content lay between 0.025 and 0.075. However, a large portion of our chloride data lies above the 3rd quartile. We can transform our chlorides to investigate.
From here, it’s a clearer picture as to where values lie.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
We can see that free.sulfur.dioxide follows a right skewed distribution where a peak is centered around 36. We see that total.sulfur.dioxide is also right skewed with a peak centered around 135. There is a secondary peak at 150 so this may be a point we should look at. In our comparison graph, we can see how there is a slight overlap between free and total sulfur dioxide. this is interesting because we can see how as you add ‘bound’ sulfur dioxide to free, it elongates the graph of total.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Sulphates appear to be distributed around the .5 range. There are a few points past 1.0 we should note for later.
Pretty normal distribution around 3.15
Seem to be distributed around .993. There seems to be a value well past a normal range.
There are 4,898 observations containing 12 features, 11 of which are chemical qualities and 1 is a ranking from 0 to 10, 10 being the best. Most of the graphs appear to follow a normal distribution. This may be the reason why the quality scores also follow a normal distribution.
The main feature is quality and some value of acidity. From univariate plots, we can see that certain graphs portray extremes in their values so those are key points we want to look at. Especially when quality is normally distributed around 6 and there are not a lot of extremes.
A large portion of the investigation will be comparing fixed and volatile acidity, along with citric acid and residual sugar levels. I believe these affect the taste of the wine of the most.
From univariate data, I was not able to see a reason to create any new existing variables. From the wines dictionary, it stated that total.sulfur.dioxide was a combination of free and bound forms of SO2, however we do not know the correct calculation to create variables.
This dataset was clean. I did have to perform adjustments to the graphs to show the extremes of certain values such as chlorides but other than that, each value for each variable was consistent in how it should be presented.
First, let’s look at how each variable relates to each other.
From this, we can see that there is no clear indication of how the quality of a wine will be based on a single chemical property. There is no strong correlation between any of the 11 variables and quality of wine. The strongest correlation would be alcohol content to quality with a correlation of .436.
While this is interesting to see that wines with a quality of 9 would only have an alcohol content greater than 10, we cannot make an assumption from such a small sample size. This graph does not really tell us much other than quality of wines widely ranges in alcohol content.
From this box plot, we can more clearly see how alcohol does not make a difference. The median alcohol content for a quality of 3 and a quality of 6 both fall around the 10.5 range.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.55 10.45 10.35 11.00 12.60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 12.40 12.50 12.18 12.70 12.90
Specifically, we can see that the median for wines with a quality of 3 have a median of 10.45 and wines with a quality of 6 have a median of 10.5. We can also deduce that even though wines with a quality of 9 have a median of 12.5 and an average of 12.18 percent of alcohol content does not mean that wines with higher alcohol content mean a higher quality. Primarily, there are only four values within this subset and with a mean appearing outside of the boxplot, the mean is highly affected by outliers. If we got more quality scores of 9, it could be that these values are not representative of the whole.
cor(wines$quality, wines$fixed.acidity)
## [1] -0.1136628
cor(wines$quality, wines$volatile.acidity)
## [1] -0.194723
cor(wines$quality, wines$citric.acid)
## [1] -0.009209091
Using cor(), we can see the relationships between the various acid and acidity levels in wines to the quality of it. From the top: fixed acidity, volatile acidity, and citric acid.
Our previous assumption that acidity of a wine would lead to quality seems to be faltering. With the thought that as acidity grew, quality would decrease. With a -.1, we could say we were heading in the right direction but the data dictionary only stated that “in high levels” of acidity would taste degrade.
Since it’s only high levels, let’s take a look at the various acidity levels above a certain threshold.
Taking a look at the top 10% of fixed acidity, we do see a slight trend where wines of a higher quality have lower and lower mean fixed acidity.
Here, we take the whole trend of fixed acidity. While following the same trend, it looks less clear as medians and means reach a middle around 7.5. Let’s do the same of volatile and citric acid.
Like its name, volatile acidity is truly volatile in its results. There is no clear trend on how volatile acidity affects quality.
Unfortunately, the same holds true for citric acid and quality of wine. There is not much of a correlation and these boxplots show that. Here, we expected to see an increase of quality as citric acid increased since it was supposed to add flavors to wines. However, it seems that the amount of citric acid does not help.
It looks as if there is a trend appearing. Lets make it easier to see clusters and remove the top percentages.
We can see an upward trend that as free sulfur dioxide increases, so does total sulfure dioxide. We expect this from the description of free and total sulfur dioxide.
Initially looking at this, we can see a clear trend in that as the amount of alcohol increases, the density of the wine decreases.
If we limit our view to density values less than 99%, we can definitely see the trend. Chemically this makes sense and reinforces the connection between water levels and alcohol.
Taking a quick glance, it doesn’t look like there’s much between pH levels and the density of the wine.
This graph further shows that lack of relationship between pH and density. While they may not relate to each other now, they could possible show a trend for quality of wines together.
This is surprising since this relationship was not noted in the data dictionary but this does chemically make sense since the more sugar you add to a liquid, the denser the product gets. Let’s take a more in depth look.
## [1] 0.8389665
We can more clearly see the relationship between residual sugar and density. Using the cor() funciton, we can see a .839 correlation between the two. Individually, they may not be able to show a correlation with quality but with a multivariate analysis, we may be able to see.
A big observation was that not one single strongly correlated with the quality of wine. This makes sense because otherwise, wine makers would have an easier time creating good wine. We have these multitude of variables because it takes all of these to make good wine.
It was also nice to see confirmation of relationships that the data dictionary stated such as acidity. It was shocking that it did not affect wine quality as much as I expected however. I was also shocked to see that residual sugar did not have a stronger correlation with quality.
I think a the most interesting relationship I found was between density and residual sugar levels. It makes sense when it is given some thought but initially it may be hard to see.
With a correlation of .835, density and residual.sugar were the strongest. Individually, compared with quality, they do not pose a strong relationship but I believe together, they may be able to provide guidance.
When we include the quality of each point to the density vs residual sugar graph, we can see an interesting separation. Those in the lower have are mostly wines with high quality. Very few low quality wines are in the lower half. It makes sense since we would like white wiens to have a lighter taste. When sugar levels are low, we make up for quality through the lightness of the drink. As sugar levels rise, the density of the sugar and the quality become muddled and tehre is less of a clear distinction. The regression lines depict this as well. Qualities of 7, 8, or 9 all start from a low density while lower quality wines start from a higher density.
We saw earlier that as alcohol increases, so does the quality of wine, which is shown here. We can also see that as density decreases, quality also tends to rise. This is clear from the observation earlier where we compared density and alcohol content. What is interesting is this fade-off of quality 5 and lower wines past 11% alcohol content and below .99 density. These lower quality wines dominate when alcohol content is below 10 but suddenly seem to disappear.
The initial graph showed that pH and density did not have much of a correlation. With this, it’s interesting to see that the lower quality wines group around the same area instead of being equally distributed.
The linear models provided by each quality between free and total sulfur dioxide portray a widely varied linear model with large area with a 95% CI for qualities of 3, 4, 8, and 9.
Looking at the graphs, we can see that wines with a quality of less than 5 usually tend to have lower free sulfur dioxide. A large majority is below 40 with less and less as free sulfur dioxide grows. Wines with a quality of 5 or 6 tend to have a wide spread between both free sulfur dioxide and total sulfur dioxide. It looks like the ratio of total sulfur dioxide to free sulfur dioxide plays a role in the quality since those with lower ratios tend to be 6. Wines with a quality of 7 or higher tend to now have total sulfur dioxide below 50 and free sulfur dioxide below 20.
With lower quality wines, it looks as if our assumption is correct where if they have a lower total sulfur dioxide to free sulfur dioxide ratio, the wines will have a higher quality. Maybe comparing sulfur dioxide to other variables will provide a clearer insight.
## [1] -0.2191773
However, using with(wines, cor(total.free, quality)) shows that the ratio provides a better correlation compared to the variables by themselves. total.free to quality is -.21917 compared to -.175 and .00816.
This graph was a surprising find since it shows a small grouping of qualities 5, 6, and 7. A cluster of qualities of 5 stay below 10% alcohol content, qualities of 6 are above 10% and below 11.5% and qualities of 7 are above 11.5%. There are of course values that break this trend but the clustering of colors shows distinct patterns. Does the ratio of alcohol to sulphates correlate strongly with qualities of wine?
## [1] 0.1753643
Unfortunately, the ratio of alcohol/sulphates does not correlate better than alcohol by itself. It went from .435 to .175.
While we would not be able to see a trend between pH and chlorides, when we overlay quality, we can see a slight trend where higher quality tend to have lower chlorides while medium qualities can be see having higher chlorides with a wider spread of pH.
Using a simple boxplot, we can see a small trend of average pH/chlorides increasing as the quality of wine increases also.
It was easier to see how quality affect the graphs where the x or y wasn’t quality. My first and favorite was residual sugar vs density. You can could clearly see a line where quality changed based on the level of density v residual sugar. It was clear to see how qualities would cluster together depending on the variables.
Other features of interest, such as alcohol, got a bit strengthened by other features that related strongly with alcohol, for example Density, since their values are defined off each other. Here, we could see where quality was clearly defined in density v alcohol graphs. Otherwise, alcohol with other variables that did not related to it, would be worse off.
An interesting feature was how density and residual sugar affected quality of wines. I think seeing the clear line was very exciting for me and enabled me to seek out other comparable features.
Another surprising one was how the total.sulfur.dioxide vs. total/free sulfur dioxide graph looked. I expected something structural like the free.sulfur.dioxide vs. total/free but it looked very seashell like. I think it was interesting to see how some values ended up making a diagonal line in varying slopes. It definitely was not how I expected the graph to look.
I find this boxplot interesting because, while it is simple, it quickly and easily proves the idea that the more acidic a wine it, the less the quality becomes. We can see a trend of the medians and overall quantiles decreasing as the quality increases. I have added a mean function for each quality and it also shows a similar trend of decreasing as the quality increases.
A key portion of this graph is that it only represents the values that have a fixed.acidity value in the top 10%. This is done to strengthen the visualization of change between each quality rank and their fixed acidity level. If we did not include this, there is still a trend but the graph displays it on a much smaller scale.
This scatterplot was my favorite during this analysis because it was the most stark contrast in differences in quality and how the variables could affect quality. First, this scatterplot shows how density and residual sugar relate to each other. As residual sugar levels rise, so does density. What makes this visualization stand out is how separated quality is from quality <= 5 and quality >= 6. Of course there are one off instances where they are in a differentarea but overall, the clusters of qualities add to the effect of the graph.
I also added a linear model expecting the trend line as show but not to be such a clear wall between the two qualities (5 and 6) of wines. As residual sugar levels go higher, the two ‘sides’ do meet at the ‘point’ of the graph.
The visualization is limited to only the bottom 99.9% since there are extreme outliers that extend the limits past a decent view point.
This scatterplot is interesting because I think viewing the relationship between alcohol and quality is clearer this way. Sulphates, in this graph, could have actually been another variable and still would have displayed the same idea. The graph of alcohol and sulphates, however, does represent this in such a nice manner where each value lines up on a grid. I changed the shapes to be squares so that it aligns nicer than circles.
From here, we can see wines of 5 quality usually have an alcohol content less than 10. We can also see wines of 7 quality tend to have an alcohol content more than 11. We can also see that a lot of the wines of quality 8 tend to have alcohol content more than 12. This is in line with the fact that alcohol has the highest correlation to quality compared to the other variables. As stated before, if we replaced sulphates with most other variables, it would still show the same increase of quality as alcohol increases.
In the White Wines data set, I expected to find a clear cut way to decide whether but that was definitely not it. I was shocked to see that alcohol content did slightly correlate to quality. It was also nice to see how graphs came together. I also found that R Documentation actually ended up helping a lot with creating graphs. I think I’ll take what I learned and be more consistent with my graphs from now on.
A lot of the struggle was with figuring out which variables would work best. I believe that if I had the bandwidth, It would have been interesting to see ratios of each variables or other ways to connect to variables to predict price such as multiplicatives or additives. I think I could also try and create a linear model but that goes back to figuring out which variables would go well together.